bug on log generation ?

  • Jump to comment-1
    marcos@f10.com.br2022-08-07T13:56:10+00:00
    I have to fix log files because its content is not properly formatted, I´m using version 14.4 but that happened when I was using version 11 too. It happens only when that statement is huge, or because it is a long sequence of updates in a WITH or DO statements, or because i´m updating a bytea field. This kind of statement occurs on log dozens of times a day, but just 2 or 3 are wrongly formatted. one example is a WITH statement with dozens of inserts and updates on it 2022-08-06 10:05:10.822 -03,"user_0566","db",1956432,"10.158.0.17:56098",62ee66ee.1dda50,1,"SELECT",2022-08-06 10:04:46 -03,266/383801,229907216,LOG,00000,"duration: 1735.107 ms execute PRSTMTST155939968154/PORTALST155939968154: /*Cobranca.ImportaCobranca*/ WITH Receber1 AS (UPDATE fin_Re> NNLiquidado1 AS (UPDATE fin_ReceberNossoNumero SET Status = 7 WHERE NossoNumero = any($${90062164}$$)), --statement continues, with some more dozens of update/inserts NNDesconsiderado48 AS (UPDATE fin_recebernossonumero SET status = 9 WHERE receber_id = 104201 AND status = 1 AND nossonumero <> 90086321), NNExcluir48 AS (UPDATE fin_recebernossonumero SET status = 4 WHERE receber_id = 104201 AND status = any($IN${2,3}$IN$) AND nossonumero <> 90086321 RETURNING recebernossonumero_id), Baixa48 AS (INSERT INTO fin_ReceberBaixa(Historico, Desconto, Valor, Lancamento) VALUES ($vs$Paga2022-08-06 10:07:07.505 -03,"user_0566","db",1956432,"10.158.0.17:56098",62ee66ee.1dda50,2,"idle",2022-08-06 10:04:46 -03,266/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"2022062701adriana.aguiar","client backend",,0 2022-08-06 10:07:07.507 -03,"user_0328","db",1957035,"10.158.0.17:57194",62ee6730.1ddcab,1,"idle",2022-08-06 10:05:52 -03,410/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"2022062701tmk06.madureira","client backend",,0 if you search for "$vs$Paga2022-08-06 10:07:07.505" you´ll see that "$vs$Paga" is still part of first statement but "2022-08-06 10:07:07.505" is the starting of next statement, but there are some missing chars of previous statement. another example is just one update with a large bytea field on it 2022-08-06 15:57:46.955 -03,"user_0591","db",2103042,"10.158.0.17:43662",62eeb9aa.201702,1,"INSERT",2022-08-06 15:57:46 -03,49/639939,230013197,LOG,00000,"duration: 11.012 ms execute PRSTMTST1612483842/PORTALST1612483842: WITH upsert AS ( UPDATE sys_var SET varBlob = $1 WHERE name = $2 RETURNING *) INSERT INTO sys_var (Name, varBlob) SELECT $3 , $4 WHERE NOT EXISTS (SELECT * FROM upsert)","parameters: $1 = '\x3c3f786d6 --field content continues $2 = '/users/39818/XMLConfig', $3 = '/users/39818/XMLConfig', $4 = '\x3c3f786d6 --field content continues e4445583e2d313c2f47524f555045445f494e4445583e32022-08-06 15:58:42.436 -03,"user_0591","db",2103042,"10.158.0.17:43662",62eeb9aa.201702,2,"idle",2022-08-06 15:57:46 -03,49/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"","client backend",,-4199849316459484872 2022-08-06 15:58:42.436 -03,"user_0143","db",2103112,"10.158.0.17:43794",62eeb9bf.201748,1,"idle",2022-08-06 15:58:07 -03,44/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"2022062701joyceb.l@hotmail.com","client backend",,0 Here "4445583e32022-08-06 15:58:42.436", where bytea content "4445583e" was being displayed and the next statement started with "32022-08-06 15:58:42.436". Obviously because that the previous line is not finished correctly and I cannot import log files properly, so I have to edit those log files to properly import them. thanks Marcos
    • Jump to comment-1
      tgl@sss.pgh.pa.us2022-08-07T14:12:10+00:00
      Marcos Pegoraro <marcos@f10.com.br> writes: > I have to fix log files because its content is not properly formatted, What mechanism are you using to store the log? If syslog is involved, it's reputed to drop data under load. regards, tom lane
      • Jump to comment-1
        marcos@f10.com.br2022-08-07T14:56:44+00:00
        it´s csvlog only Atenciosamente, Em dom., 7 de ago. de 2022 às 11:12, Tom Lane <tgl@sss.pgh.pa.us> escreveu: > Marcos Pegoraro <marcos@f10.com.br> writes: > > I have to fix log files because its content is not properly formatted, > > What mechanism are you using to store the log? If syslog is involved, > it's reputed to drop data under load. > > regards, tom lane >
        • Jump to comment-1
          andres@anarazel.de2022-08-07T16:47:01+00:00
          Hi, On 2022-08-07 11:56:44 -0300, Marcos Pegoraro wrote: > it´s csvlog only How are you running postgres? If the logger process runs into trouble it might write to stderr. Is there a chance your huge statements would make you run out of space? Greetings, Andres Freund
          • Jump to comment-1
            marcos@f10.com.br2022-08-08T11:34:08+00:00
            > > > How are you running postgres? If the logger process runs into trouble it > might > write to stderr. > > Is there a chance your huge statements would make you run out of space? > > Well, I don't think it is a out of space problem, because it doesn´t stop logging, it just splits that message. As you can see, the next message is logged properly. And that statement is not so huge, these statements have not more than 10 or 20kb. And as I said these statements occur dozens of times a day, but only once or twice is not correctly logged An additional info, that splitted message has an out of order log time. At that time the log file was having 2 or 3 logs per second, and that message was 1 or 2 minutes later. It seems like it occurs now but it's stored a minute or two later. thanks Marcos
            • Jump to comment-1
              andrew@dunslane.net2022-08-08T13:59:10+00:00
              On 2022-08-08 Mo 07:34, Marcos Pegoraro wrote: > > > How are you running postgres? If the logger process runs into > trouble it might > write to stderr. > > Is there a chance your huge statements would make you run out of > space? > > Well, I don't think it is a  out of space problem, because it > doesn´t stop logging, it just splits that message. As you can see, the > next message is logged properly. And that statement is not so huge, > these statements have not more than 10 or 20kb. And as I said these > statements occur dozens of times a day, but only once or twice is > not correctly logged  > An additional info, that splitted message has an out of order log > time. At that time the log file was having 2 or 3 logs per second, and > that message was 1 or 2 minutes later. It seems like it occurs now but > it's stored a minute or two later. > > It looks like a failure of the log chunking protocol, with long messages being improperly interleaved. I don't think we've had reports of such a failure since commit c17e863bc7 back in 2012, but maybe my memory is failing. What platform is this on? Is it possible that on some platform the chunk size we're using is not doing an atomic write? syslogger.h says: #ifdef PIPE_BUF /* Are there any systems with PIPE_BUF > 64K?  Unlikely, but ... */ #if PIPE_BUF > 65536 #define PIPE_CHUNK_SIZE  65536 #else #define PIPE_CHUNK_SIZE  ((int) PIPE_BUF) #endif #else                           /* not defined */ /* POSIX says the value of PIPE_BUF must be at least 512, so use that */ #define PIPE_CHUNK_SIZE  512 #endif cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
              • Jump to comment-1
                marcos@f10.com.br2022-08-08T15:07:59+00:00
                > > What platform is this on? Is it possible that on some platform the chunk > size we're using is not doing an atomic write? > Until last year we were Ubuntu 16.04 and Postgres 11 with the latest minor update. This January we changed to Ubuntu 20.04 and Postgres 14, now updated to 14.4. But the problem occured on both old and new SO and Postgres versions. Right now I opened the current log file and there are 20 or 30 of these statements and all of them are fine, maybe tomorrow the problem comes back, maybe this afternoon. thanks Marcos
                • Jump to comment-1
                  andrew@dunslane.net2022-08-08T15:36:17+00:00
                  On 2022-08-08 Mo 11:07, Marcos Pegoraro wrote: > > What platform is this on? Is it possible that on some platform the > chunk > size we're using is not doing an atomic write? > >   > Until last year we were Ubuntu 16.04 and Postgres 11 with the latest > minor update. > This January we changed to Ubuntu 20.04 and Postgres 14, now updated > to 14.4. > > But the problem occured on both old and new SO and Postgres versions. > Right now I opened the current log file and there are 20 or 30 of > these statements and all of them are fine, maybe tomorrow the problem > comes back, maybe this afternoon. > > OK, we really need a repeatable test if possible. Perhaps a pgbench run with lots of concurrent runs of a some very long query would do the trick. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
                  • Jump to comment-1
                    marcos@f10.com.br2022-08-08T18:34:27+00:00
                    > > OK, we really need a repeatable test if possible. Perhaps a pgbench run > with lots of concurrent runs of a some very long query would do the trick. > OK, I can do it but ... strangely that error usually occurs at random times, sometimes at 08:00, sometimes at 19:00, and it's busier between 10:00 and 16:00. If I cron some of those queries to run every second is enough ? What exactly do you expect to see on log files ?
              • Jump to comment-1
                tgl@sss.pgh.pa.us2022-08-08T14:32:22+00:00
                Andrew Dunstan <andrew@dunslane.net> writes: > What platform is this on? Is it possible that on some platform the chunk > size we're using is not doing an atomic write? Another idea is that some of the write() calls are failing --- elog.c doesn't check for that. Eyeing the POSIX spec for write(), I wonder if somehow the pipe has gotten set into O_NONBLOCK mode and we're not retrying EAGAIN failures. regards, tom lane
                • Jump to comment-1
                  andres@anarazel.de2022-08-08T16:02:04+00:00
                  Hi, On 2022-08-08 10:32:22 -0400, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > What platform is this on? Is it possible that on some platform the chunk > > size we're using is not doing an atomic write? > > Another idea is that some of the write() calls are failing --- elog.c > doesn't check for that. I was suspicious of those as well. It might be a good idea to at least write such failures to stderr, otherwise it's just about impossible to debug. Not that stderr will always point anywhere useful... I can imagine that a system under heavy memory pressure might fail writing, if there's a lot of writes in a row or such. Greetings, Andres Freund
                  • Jump to comment-1
                    tgl@sss.pgh.pa.us2022-08-08T16:19:22+00:00
                    Andres Freund <andres@anarazel.de> writes: > On 2022-08-08 10:32:22 -0400, Tom Lane wrote: >> Another idea is that some of the write() calls are failing --- elog.c >> doesn't check for that. > I was suspicious of those as well. It might be a good idea to at least write > such failures to stderr, otherwise it's just about impossible to debug. Not > that stderr will always point anywhere useful... Uh ... what we are talking about is a failure to write to stderr. It's not likely that adding more output will help. Having said that, looping on EAGAIN seems like a reasonably harmless change. Whether it will help here is really hard to say, though. regards, tom lane
                    • Jump to comment-1
                      andres@anarazel.de2022-08-08T16:38:22+00:00
                      Hi, On 2022-08-08 12:19:22 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-08-08 10:32:22 -0400, Tom Lane wrote: > >> Another idea is that some of the write() calls are failing --- elog.c > >> doesn't check for that. > > > I was suspicious of those as well. It might be a good idea to at least write > > such failures to stderr, otherwise it's just about impossible to debug. Not > > that stderr will always point anywhere useful... > > Uh ... what we are talking about is a failure to write to stderr. > It's not likely that adding more output will help. I forgot that we don't preserve the original stderr in some other fd, likely because the logger itself still has it open and can use write_stderr(). Greetings, Andres Freund